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Abstract 



A sequence xi, . . . ,Xn, . . . of discrete- valued observations is generated according to some unknown 

probabilistic law (measure) fi. After observing each outcome, it is required to give the conditional 

P^ ' probabilities of the next observation. The realizable case is when the measure fj. belongs to an arbitrary 

^ , but known class C of process measures. The non-realizable case is when /i is completely arbitrary, but the 

' . prediction performance is measured with respect to a given set C of process measures. We are interested 

c7^ ' in the relations between these problems and between their solutions, as well as in characterizing the cases 

Cn , when a solution exists, and finding these solutions. We show that if the quality of prediction is measured 

' by total variation distance, then these problems coincide, while if it is measured by expected average KL 

O divergence, then they are different. For some of the formalizations we also show that when a solution 

exists, it can be obtained as a Bayes mixture over a countable subset of C. As an illustration to the 
1—^ ' general results obtained, we show that a solution to the non-realizable case of the sequence prediction 

[/3 I problem exists for the set of all finite-memory processes, but does not exist for the set of all stationary 

O . processes. 

The framework is completely general: the processes measures considered are not required to be i.i.d., 
fvj , mixing, stationary, or to belong to any parametric family. 

>■ 

O '■ 1 Introduction 

\o. 

•/^ I A sequence xi, . . . ,Xn, ■ ■ ■ of discrete- valued observations {xi £ X, X is finite) is generated according to 

1/^ ■ some unknown probabilistic law (measure). That is, ^ is a probability measure on the space J7 — {X°°,B) 

^^ I of one-way infinite sequences (here B is the usual Borel cr-algebra). After each new outcome Xn is revealed, 

^^ ■ it is required to predict conditional probabilities of the next observation Xn+i = a, a E X, given the past 

. . I Xi, . . . , Xn- Since a predictor p is required to give conditional probabilities p{xn+i — ci\xi, . . . , a;„) for all pos- 

^ i sible histories xi, . . . , x„, it defines itself a probability measure on the space il of one-way infinite sequences. 

KJi I In other words, a probability measure on J7 can be considered both as a data-generating mechanism and as 

'V^ i a predictor. 

C^ ' Therefore, given a set C of probability measures on J7, one can ask two kinds of questions about it. First, 

does there exist a predictor p, whose forecast probabilities converge (in a certain sense) to the /i-conditional 
probabilities, if an arbitrary p. Cz C is chosen to generate the data? Here we assume that the "true" measure 
that generates the data belongs to the set C of interest, and would like to construct a predictor that predicts 
all measures in C. The second type of questions is as follows: does there exist a predictor that predicts at 
least as well as any predictor p G C, if the measure that generates the data comes possibly from outside of 
C? Therefore, here we consider elements of C as predictors, and we would like to combine their predictive 
properties, if this is possible. Note that in this setting the two questions above concern the same object: a 
set C of probability measures on fl. 

Each of these two questions, the realizable and non-realizable one, have enjoyed much attention in the 
literature; the setting for the non-realizable case is usually slightly different, which is probably why it has not 
(to the best of the author's knowledge) been studied as another facet of the realizable case. The realizable 
case traces back to Laplace, who has considered the problem of predicting outcomes of a series of independent 
tosses of a biased coin. That is, he has considered the case when the set C is that of all i.i.d. process measures. 



Other classical examples studied are the set of all computable (or semi-computable) measures [14], the set 
of /c-order Markov and finite-memory processes (e.g. [5]) and the set of all stationary processes [8]. The 
general question of finding predictors for an arbitrary given set C of process measures has been addressed in 
[11, 12, 10]; the latter work shows that when a solution exists it can be obtained as a Bayes mixture over a 
countable subset of C. 

The non-realizable case is usually studied in a slightly different, non-probabilistic, setting. We refer to [3] 
for a comprehensive overview. It is usually assumed that the observed sequence of outcomes is an arbitrary 
(deterministic) sequence; it is required not to give conditional probabilities, but just deterministic guesses 
(although these guesses can be selected using randomisation). Predictions result in a certain loss, which is 
required to be small as compared to the loss of a given set of reference predictors (experts) C. The losses 
of the experts and the predictor are observed after each round. In this approach, it is mostly assumed that 
the set C is finite or countable. The main difference with the formulation considered in this work is that 
we require a predictor to give probabilities, and thus the loss is with respect to something never observed 
(probabilities, not outcomes). The loss itself is not completely observable in our setting. In this sense our 
non- realizable version of the problem is more difficult. Assuming that the data generating mechanism is 
probabilistic, even if it is completely unknown, makes sense in such problems as, for example, game playing, 
or market analysis. In these cases one may wish to assign smaller loss to those models or experts who give 
probabilities closer to the correct ones (which are never observed) , even though different probability forecasts 
can often result in the same action. Aiming at predicting probabilities of outcomes also allows us to abstract 
from the actual use of the predictions (e.g. making bets) and thus from considering losses in a general form; 
instead, we can concentrate on the form of losses (measuring the discrepancy between the forecast and true 
probabilities) which are more convenient for the analysis. In this latter respect, the problems we consider are 
easier than those considered in prediction with expert advice. (However, in principle nothing restricts us to 
considering the simple losses that we chose; they are just a convenient choice.) Noteworthy, the probabilistic 
approach also makes the machinery of probability theory applicable, hopefully making the problem easier. 

In this work we consider two measures of the quality of prediction. The first one is the total variation 
distance, which measures the difference between the forecast and the "true" conditional probabilities of all 
future events (not just the probability of the next outcome). The second one is expected (over the data) 
average (over time) Kullback-Leibler divergence. Requiring that predicted and true probabilities converge in 
total variation is very strong; in particular, this is possible if [1] and only if [4] the process measure generating 
the data is absolutely continuous with respect to the predictor. The latter fact makes the sequence prediction 
problem relatively easy to analyse. Here we investigate what can be paralleled for the other measure of 
prediction quality (average KL divergence), which is much weaker, and thus allows for solutions for the cases 
of much larger sets C of process measures (considered either as predictors or as data generating mechanisms) . 

Having introduced our measures of prediction quality, we can further break the non-realizable case into 
two problems. The first one is as follows. Given a set C of predictors, we want to find a predictor whose 
prediction error converges to zero if there is at least one predictor in C whose prediction error converges to 
zero; we call this problem simply the "non-realisable" case, or Problem 2 (leaving the name "Problem 1" to 
the realizable case). The second problem is the "fully agnostic" problem: it is to make the prediction error 
asymptotically as small as that of the best (for the given process measure generating the data) predictor in 
C (we call this Problem 3). Thus, we now have three problems about a set of process measures C to address. 

We show that if the quality of prediction is measured in total variation, then all the three problems 
coincide: any solution to any one of them is a solution to the other two. For the case of expected average 
KL divergence, all the three problems are different: the realizable case is strictly easier than non-realizable 
(Problem 2), which is, in turn, strictly easier than the fully agnostic case (Problem 3). We then analyse 
which results concerning prediction in total variation can be transferred to which of the problems concerning 
prediction in average KL divergence. It was shown in [10] that, for the realizable case, if there is a solution 
for a given set of process measures C, then a solution can also be obtained as a Bayesian mixture over a 
countable subset of C; this holds both for prediction in total variation and in expected average KL divergence. 
Here we show that this result also holds true for the (non-realizable) case of Problem 2, for prediction in 
expected average KL divergence. For the fully agnostic case of Problem 3, we show that separability with 



respect to a certain topology given by KL divergence is a sufficient (though not a necessary) condition for 
the existence of a predictor. This is used to demonstrate that there is a solution to this problem for the set 
of all finite-memory process measures, complementing similar results obtained earlier in different settings. 
On the other hand, we show that there is no solution to this problem for the set of all stationary process 
measures, in contrast to a result of [8] which gives a solution to the realizable case of this problem (that is, 
a predictor whose expected average KL error goes to zero if any stationary process is chosen to generate the 
data). 

2 Preliminaries 

Let A" be a finite set. The notation xi,,n is used for xi, . . . , x„. We consider stochastic processes (probability 
measures) on fl := {X°°,B) where B is the sigma-field generated by the cylinder sets [a;i..„], Xi G X,n £ N 
and [xi..„] is the set of all infinite sequences that start with xi..„. For a finite set A denote \A\ its cardinality. 
We use E^ for expectation with respect to a measure /i. 

Next we introduce the measures of the quality of prediction used in this paper. For two measures /i and 
p we are interested in how different the fi- and p-conditional probabilities are, given a data sample xi,,n- 
Introduce the (conditional) total variation distance 

v{n,p,Xi,,n) := sup \p{A\xi„n) - p{A\xi,,n)\, 

AeB 
if /i(xi..„) 7^ and p(xi..„) 7^ 0, and v{fi, p,xi,,n) — 1 otherwise. 
Definition 1. We say that p predicts p in total variation if 

v{p,p,xi,,n) -^ p-a.s. 

This convergence is rather strong. In particular, it means that p-conditional probabilities of arbitrary 
far-off events converge to /^-conditional probabilities. Moreover, p predicts p in total variation if [1] and only 
if [4] p is absolutely continuous with respect to p. Denote >t^ the relation of absolute continuity (that is, 
p >t„ p \i p is absolutely continuous with respect to p). 

Thus, for a class C of measures there is a predictor p that predicts every p £ C \n total variation if and 
only if every p d C has a density with respect to p. Although such sets of processes are rather large, they 
do not include even such basic examples as the set of all Bernoulli i.i.d. processes. That is, there is no 
p that would predict in total variation every Bernoulli i.i.d. process measure Sp, p Cz [0, 1], where p is the 
probability of 0. Therefore, perhaps for many (if not most) practical applications this measure of the quality 
of prediction is too strong, and one is interested in weaker measures of performance. 

For two measures p and p introduce the expected cumulative Kullback-Leibler divergence (KL divergence) 
as 

dn{p, p) — E^ y] y] M(a;t = a|a;i..t-i) log M^« - a\xi..t^i) 

In words, we take the expected (over data) cumulative (over time) KL divergence between p- and p- 
conditional (on the past data) probability distributions of the next outcome. 

Definition 2. We say that p predicts p in expected average KL divergence if 

1 



n 



-dn(p,p) -^ 0. 



This measure of performance is much weaker, in the sense that it requires good predictions only one 
step ahead, and not on every step but only on average; also the convergence is not with probability 1 
but in expectation. With prediction quality so measured, predictors exist for relatively large classes of 



measures; most notably, [8] provides a predictor which predicts every stationary process in expected average 
KL divergence. We will use the following well-known identity 

dn{f^,p) = - V M(a;i..„)log ••" (2) 

XI. .^^X" ^^ ' 

where on the right-hand side we have simply the KL divergence between measures [i and p restricted to the 
first n observations. 

Thus, the results of this work will be established with respect to two very different measures of prediction 
quality, one of which is very strong and the other rather weak. This suggests that the facts established reflect 
some fundamental properties of the problem of prediction, rather than those pertinent to particular measures 
of performance. On the other hand, it remains open to extend the results below to different measures of 
performance. 

Definition 3. Introduce the following classes of process measures: V the set of all process measures, V the 
set of all degenerate discrete process measures, S the set of all stationary processes, and Aik the set of all 
stationary measures with memory not greater than k (k- order Markov processes, with Mq being the set of 
all i.i.d. processes): 

V:^{nev -.Bxe X°°n{x) ^ 1} , (3) 

5 := {/i e T' : Vn, fc > Vai..„ e A"" fi{xi„n = ai..„) == ^i{xi+k..n+k = ai..„)} . (4) 

Alfc := {a* e 5 : Vn > OVa e X\iai„^ e A"" 



p{xn+i = a\xi,,n = ai..„) = pL{xk+i = a|xi..fe = ai..fc)|, (5) 



Abusing the notation, we will sometimes use elements of D and X°° interchangeably. The following 
simple statement (whose proof is obvious) will be used repeatedly in the examples. 

Lemma 1. For every p Cz V there exists /i G 2? such that dn{p, p) > n\og \X\ for all n G N. 

3 Sequence prediction problems 

For the two notions of predictive quality introduced, we can now start stating formally the sequence prediction 

problems. 

Problem 1 (realizable case). Given a set of probability measures C, find a measure p such that p predicts 

in total variation (expected average KL divergence) every /^ G C, if such a p exists. 

Thus, Problem 1 is about finding a predictor for the case when the process generating the data is known 
to belong to a given class C. The set C here is a set of measures generating the data. Next let us formulate 
the questions about C as a set of predictors. 

Problem 2 (non-realizable case). Given a set of process measures (predictors) C, find a process measure p 
such that p predicts in total variation (in expected average KL divergence) every measure v ^ V such that 
there \s p &C which predicts (in the same sense) v. 

While Problem 2 is already quite general, it does not yet address what can be called the fully agnostic 
case: if nothing at all is known about the process v generating the data, it means that there may be no 
p E C such that p predicts v, and then, even if we have a solution p to the Problem 2, we still do not know 
what the performance of p on i^ is going to be, compared to the performance of the predictors from C. To 
address the fully agnostic case, we have to introduce the notion of loss. 

Definition 4. Introduce the almost sure total variation loss of p with respect to p 
ItvilJ-, p) '-^ infja G [0, 1] : limsupf (/i, /o, Xi „) < a p-a.s.}, 



and the asymptotic KL loss 

Ikl{i'-,p) : = lim sup -dn{v.,p)- 

We can now formulate the fully agnostic version of the sequence prediction problem. 
Problem 3. Given a set of process measures (predictors) C, find a process measure p such that p predicts 
at least as well as any /z in C, if any process measure v ^ V in chosen to generate the data: l{v, p) < l{v^ fi) 
for every v gP and every p £C, where l{-, •) is either ltv{-, •) or Ikl{-, •)• 

The three problems just formulated represent different conceptual approaches to the sequence prediction 
problem. Let us illustrate the difference by the following informal example. Suppose that the set C is 
that of all (ergodic, finite-state) Markov chains. Markov chains being a familiar object in probability and 
statistics, we can easily construct a predictor p that predicts every p G C (for example, in expected average 
KL divergence, see [5]). That is, if we know that the process p generating the data is Markovian, we know 
that our predictor is going to perform well. This is the realizable case of Problem 1 . In reality, rarely can 
we be sure that the Markov assumption holds true for the data at hand. We may believe, however, that it is 
still a reasonable assumption, in the sense that there is a Markovian model which, for our purposes (for the 
purposes of prediction) , is a good model of the data. Thus we may assume that there is a Markov model 
(a predictor) that predicts well the process that we observe, and we would like to combine the predictive 
qualities of all these Markov models. This is the "non-realizable" case of Problem 2. Note that this problem 
is more difficult than the first one; in particular, a process v generating the data may be singular with 
respect to any Markov process, and still be well predicted (in the sense of expected average KL divergence, 
for example) by some of them. Still, here we are making some assumptions about the process generating 
the data, and if these assumptions are wrong, then we do not know anything about the performance of our 
predictor. Thus we may ultimately wish to acknowledge that we do not know anything at all about the data; 
we still know a lot about Markov processes, and we would like to use this knowledge on our data. If there is 
anything at all Markovian in it (that is, anything that can be captured by a Markov model), then we would 
like our predictor to use it. In other words, we want to have a predictor that predicts any process measure 
whatsoever (at least) as well as any Markov predictor. This is the "fully agnostic" case of Problem 3. 

Of course, Markov processes were just mentioned as an example, while in this work we are only concerned 
with the most general case of arbitrary unknown (uncountable) sets C of process measures. 

The following statement is rather obvious. 

Proposition 1. Any solution to Problem S is a solution to Problem 2, and any solution to Problem 2 is a 
solution to Problem 1. 

Despite the conceptual differences in formulations, it may be somewhat unclear whether the three prob- 
lems are indeed different. It appears that this depends on the measure of predictive quality chosen: for the 
case of prediction in total variation distance, all the three problems coincide, while for the case of prediction 
in expected average KL divergence, they are different. 

4 Prediction in Total Variation 

As it was mentioned, a measure p is absolutely continuous with respect to a measure p if and only if p 
predicts p in total variation distance. This reduces studying at least Problem 1 for total variation distance 
to studying the relation of absolute continuity. Introduce the notation p >t„ p for this relation. 

Let us briefly recall some facts we know about >t„; details can be found, for example, in [6]. Let \P]tv 
denote the set of equivalence classes of V with respect to >t„, and for p E Vtv denote [p] the equivalence 
class that contains p. Two elements (Ti,(T2 G [P]tv (or cri,(J2 € V) are called disjoint (or singular) if there 
is no J^ G [V]tv such that cti >t„ v and (J2 >tu t^; in this case we write cti J-tv ^2- We write [/ii] + [/i2] for 
[^(/^i + P'i)\- Every pair G\^ai G \P\tv has a supremum sup((Ti,CT2) = (Ti + 02- Introducing into \P\tv a-n 
extra element such that a >t^ for all a G [P]tv, we can state that for every p, p E ['P]tv there exists a 
unique pair of elements Ps and Pa such that p — Pa + P's, P^ Pa and p J-tv Ps- (This is a form of Lebesgue 



decomposition.) Moreover, /!„ = ml{p,ji). Thus, every pair of elements has a supremum and an infimum. 
Moreover, every bounded set of disjoint elements of [V]tv is at most countable. 

Furthermore, wc introduce the (unconditional) total variation distance between process measures. 

Definition 5 (unconditional total variation distance). The (unconditional) total variation distance is defined 

as 

v{fi,p) := sup \p{A) - p{A)\. 
AeB 

Known characterizations of those sets C that are bounded with respect to >4„ can now be related to our 
prediction problems 1-3 as follows. 

Theorem 1. Let C C V. The following statements about C are equivalent. 

(i) There exists a solution to Problem 1 in total variation. 

(a) There exists a solution to Problem 2 in total variation. 

(Hi) There exists a solution to Problem 3 in total variation. 

(iv) C is upper-bounded with respect to >t„. 

(v) There exists a sequence /ife G C, /c G N such that for some (equivalently, for every) sequence of weights 
Wk G (0, 1], fc G N such that X^fceN^*: ^ ^' ^^^ measure v — X]feeN^*:/^fe satisfies v >t„ /i for every 

(vi) C is separable with respect to the total variation distance. 

(vii) Let C"*" := {/i G T' : Bp G Cp >t„ ji}. Every disjoint (with respect to >t^) subset of C'^ is at most 
countable. 

Moreover, every solution to any of the Problems 1-3 is a solution to the other two, as is any upper bound 
forC. The sequence pk in the statement (v) can be taken to be any dense (in the total variation distance) 
countable subset ofC (cf. (vi)), or any maximal disjoint (with respect to >t^) subset ofC^ of statement (vii), 
in which every measure that is not in C is replaced by any measure from C that dominates it. 

Proof The implications (i) <= (ii) <= (Hi) are obvious (cf. Proposition 1). The implication (iv) => (i) is a 
reformulation of the result of [1]. The converse (and hence (v) =» (iv)) was established in [4]. (i) =^ (ii) 
follows from the equivalence (i) <(=> {iv) and the transitivity of >t„; (i) => (Hi) follows from the transitivity 
of >t^ and from Lemma 2 below: indeed, from Lemma 2 we have ItviviP) = if p >tv v and Itviy-ili) = 1 
otherwise. From this and the transitivity of >t„ it follows that if p >tv P then also Itvi^, p) < ltv{v, p) for 
all v E v. The equivalence of (v), (vi), and (i) was established in [10]. The equivalence of (iv) and (vii) 
was proven in [6] . The concluding statements of the theorem are easy to demonstrate from the results cited 
above. D 

The following lemma is an easy consequence of [1]. 

Lemma 2. Let p, p be two process measures. Then v{p, p, xi..„) converges to either or 1 with p-probability L 

Proof. Assume that p is not absolutely continuous with respect to p (the other case is covered by [1]). By 
Lebesgue decomposition theorem, the measure p admits a representation p = apa + il — Oi)ps where a G [0, 1] 
and the measures pa and ps are such that pa is absolutely continuous with respect to p and ps is singular 
with respect to p. Let W be such a set that pa{W) = p{W) = 1 and Ps{W) = 0. Note that we can take 
A*a = p\w and ps — t^\x°°\w- From [1] we have v{pa,p,xi,,n) -^ /Zo-a.s., as well as v{pa,p,xi..n) -> 
Pa-a.s. and v{ps, p,xi,,n) -^ ps-a.s. Moreover, w(ps, p, xi..„) > \ps{W\xi,,n) - p{W\xi,,n)\ = 1 so that 
v{ps, p,xi,,n) -^ 1 /is-a.s. Furthermore, 

v{p, p, Xi„n) < v{p, Pa, Xi,,n) + v{pa, p, Xi..„) = / 



and 

We have / — > /ia-a.s. and hence /i|vi/-a.s., as well as // — > 1 ^^-a.s. and hence ii\x<=°\w~^-^- Thus, 

^l{vi^l,p,xl„n) ^ or 1) < ^i{w)^l\w{I ^ o) + /x(a'°°\w)mU~\w'(// ^ i) = m(w^) + fi{X"-\w) = i, 

which concludes the proof. D 

Remark. Using Lemma 2 we can also define expected (rather than almost sure) total variation loss of p 
with respect to fi, as the /^-probability that u(/i, p) converges to 1: 

I'tyin, p) := /i{xi, a;2, • • • e X°° : v{p., p, xi..„) -^ 1}. 

Then Problem 3 can be reformulated for this notion of loss. However, it is easy to see that for this reformu- 
lation Theorem 1 holds true as well. 

Thus, we can see that, for the case of prediction in total variation, all the sequence prediction problems 
formulated reduce to studying the relation of absolute continuity for process measures and those families 
of measures that are absolutely continuous (have a density) with respect to some measure (a predictor). 
On the one hand, from a statistical point of view such families are rather large: the assumption that the 
probabilistic law in question has a density with respect to some (nice) measure is a standard one in statistics. 
It should also be mentioned that such families can easily be uncountable. (In particular, this means that 
they are large from a computational point of view.) On the other hand, even such basic examples as the set 
of all Bernoulli i.i.d. measures does not allow for a predictor that predicts every measure in total variation 
(as explained in Section 2). 

That is why we have to consider weaker notions of predictions; from these, prediction in expected average 
KL divergence is perhaps one of the weakest. The goal of the next sections is to see which of the properties 
that we have for total variation can be transferred (and in which sense) to the case of expected average KL 
divergence. 

5 Prediction in Expected Average KL Divergence 

First of all, we have to observe that for prediction in KL divergence Problems 1,2, and 3 are different, as the 
following theorem shows. While the examples provided in the proof are artificial, there is a very important 
example illustrating the difference between Problem 1 and Problem 3 for expected average KL divergence: 
the set S of all stationary processes, given in Theorem 6 in the end of this section. 

Theorem 2. For the case of prediction in expected average KL divergence, Problems 1, 2 and 3 are different: 
there exists a set Ci d V for which there is a solution to Problem 1 but there is no solution to Problem 2, 
and there is a set C2 dV for which there is a solution to Problem 2 but there is no solution to Problem 3. 

Proof. We have to provide two examples. Fix the binary alphabet X — {0, 1}. For each deterministic 
sequence t ~ ti,t2,- ■ ■ G X°° construct the process measure -ft as follows: 7t(x„ — i„|ii..„_i) := 1 — -^^ 
and for xi..„_i 7^ ii..n-i let 7t(x„ = 0|a;i..„_i) = 1/2, for all n € N. That is, 74 is Bernoulh i.i.d. 1/2 
process measure strongly biased towards a specific deterministic sequence, t. Let also 7(xi..„) = 2^" for all 
xi..n € X^, n £ N (the Bernoulli i.i.d. 1/2). For the set Ci := {74 : t E X°°} we have a solution to Problem 
1: indeed, dn{'jt,"/) ^ 1 = o{n). However, there is no solution to Problem 2. Indeed, for each t E T> we have 
dn{t,-/t) — logn = o{n) (that is, for every deterministic measure there is an element of Ci which predicts it), 
while by Lemma 1 for every p E V there exists t G T> such that dn{t,p) > n for all n e N (that is, there is 
no predictor which predicts every measure that is predicted by at least one element of Ci ) . 

The second example is similar. For each deterministic sequence t — ti,t2, ■ ■ ■ E T> construct the process 
measure 74 as follows: 7t(a:„ = tn\ti..n-i) '.= 2/3 and for xi,n-i 7^ ii..n-i let 7t(a;„ = 0|a;i..„_i) = 1/2, for 
all n £ N. It is easy to see that 7 is a solution to Problem 2 for the set C2 ■= {jt ■ ^ "== X°^}. Indeed, if 
j^ e T' is such that (i„(j^, 7') = o{n) then we must have J^(ti..n) = o(l). From this and the fact that 7 and 7' 
coincide (up to 0(1)) on all other sequences we conclude d„(i^, 7) = o(n). However, there is no solution to 



Problem 3 for €2- Indeed, for every t e I? we have dn{t,'y'^) = nlog3/2 + o{n). Therefore, if p is a solution 
to Problem 3 then limsup —dn{t,p) < log 3/2 < 1 whieh contradiets Lemma 1. D 

Thus, prediction in expected average KL divergence turns out to be a more complicated matter than pre- 
diction in total variation. The next idea is to try and see which of the facts about prediction in total variation 
can be generalized to some of the problems concerning prediction in expected average KL divergence. 

First, observe that, for the case of prediction in total variation, the equivalence of Problems 1 and 2 was 
derived from the transitivity of the relation >t„ of absolute continuity. For the case of expected average KL 
divergence, the relation "p predicts /i in expected average KL divergence" is not transitive (and Problems 1 
and 2 are not equivalent). However, for Problem 2 we are interested in the following relation: p "dominates" 
/i if /9 predicts every i> such that fi predicts v. Denote this relation by >%^: 



Definition 6 {>U. 

limsup i(i„(z^,/9) ==0. 



We write p >%j^ p if for every ly E V the equality limsup —dn{i^, p) = implies 



The relation >%-,^ has some similarities with >4„. First of all, >'^j^ is also transitive (as can be easily 
seen from the definition). Moreover, similarly to >4„, one can show that for any p, p any strictly convex 
combination ap+{l — a)p is a supremum of {p, p} with respect to >'^i. Next we will obtain a characterization 
of predictability with respect to >^j^ similar to one of those obtained for >t^. 

The key observation is the following. If there is a solution to Problem 2 for a set C then a solution can 
be obtained as a Bayesian mixture over a countable subset of C For total variation this is the statement (v) 
of Theorem 1 . 

Tiieorem 3. Let C be a set of probability measures on fl. If there is a measure p such that p >°^^ p for every 
p (z C (p is a solution to Problem 2) then there is a sequence /ife G C, fc G N, such that X^fceN ^kt^k ^'kl M 
for every p (z C, where Wk are some positive weights. 

The proof is deferred to Appendix. An analogous result for Problem 1 was established in [9] . (The proof 
of Theorem 3 is based on similar ideas, but is more involved.) 

For the case of Problem 3, we do not have results similar to Theorem 3 (or statement (v) of Theorem 1); 
in fact, we conjecture that the opposite is true: there exists a (measurable) set C of measures such that there 
is a solution to Problem 3 for C, but there is no Bayesian solution to Problem 3, meaning that there is no 
probability distribution on C (discrete or not) such that the mixture over C with respect to this distribution 
is a solution to Problem 3 for C. 

However, we can take a different route and extend another part of Theorem 1 to obtain a characterization 
of sets C for which a solution to Problem 3 exists. 

We have seen that, in the case of prediction in total variation, separability with respect to the topology 
of this distance is a necessary and sufficient condition for the existence of a solution to Problems 1-3. In the 
case of expected average KL divergence the situation is somewhat different, since, first of all, (asymptotic 
average) KL divergence is not a metric. While one can introduce a topology based on it, separability with 
respect to this topology turns out to be a sufficient but not a necessary condition for the existence of a 
predictor, as is shown in the next theorem. 

Definition 7. Define the distance d^{pi, P2) on process measures as follows 



doo{pi,P2) =lva\SMY> sup - 



log 



Mi(a;i..«) 



M2(a;i..„) 



(6) 



where we assume log 0/0 :— 0. 

Clearly, doo is symmetric and satisfies the triangle inequality, but it is not exact. Moreover, for every 
pi, P2 we have 

\imsup-dn{pi,P2)<doo{tLi,t^2)- (7) 



The distance (ioo(/^i, M2) measures the difference in behaviour of ^i and ^2 on ah individual sequences. Thus, 
using this distance to analyse Problem 3 is most close to the traditional approach to the non-realizable case, 
which is formulated in terms of predicting individual deterministic sequences. 

Theorem 4. (i) Let C be a set of process measures. If C is separable with respect to d^ then there is a 
solution to Problem 3 for C, for the case of prediction in expected average KL divergence. 

(a) There exists a set of process measures C such that C is not separable with respect to d^o, but there is a 
solution to Problem 3 for this set, for the case of prediction in expected average KL divergence. 

Proof. For the first statement, let C be separable and let {fJ.k)keN be a dense countable subset of C. Define 
ly :— X^feGN^fc/^*:' where Wk are any positive summable weights. Fix any measure r and any ^ qC. We will 
show that limsup„^gQ —dn{T, v) < limsup^^^Q —dn{T, ^). For every e, find such a fc G N that d^oip, jJ-k) ^ £• 
We have 



rfn(T, V) < dn{T, Wk^J-k) = E^ log 



-(xi 



l^k{xi 
E.log 



- log Wk 



t{xi 



fi{xi 



E.log 



^(xi 



fJ-kixL.n) 

< dn{T,^l) + 



log Wk 



sup 



log 



^(xi 



lJ-k{xi 



log Wk. 



From this, dividing by n taking limsup„_^Q^ on both sides, we conclude 

lini sup — dn (r, i>) < lini sup — rf„ (t, ji) + e. 

n— 7-00 ^ n— >oo ^ 

Since this holds for every e > the first statement is proven. 

The second statement is proven by the following example. Let C be the set of all deterministic sequences 
(measures concentrated on just one sequence) such that the number of Os in the first n symbols is less than 
-yn, for all n G N. Clearly, this set is uncountable. It is easy to check that /ii 7^ ^2 implies dodni, H2) = 00 
for every 111,112 G C, but the predictor v, given by v{xn = 0) = 1/n independently for different n, predicts 
every /i G C in expected average KL divergence. Since all elements of C are deterministic, v is also a solution 
to Problem 3 for C. D 

Although simple, Theorem 4 can be used to establish the existence of a solution to Problem 3 for an 
important class of process measures: that of all processes with finite memory, as the next theorem shows. 
Results similar to Theorem 5 are known in different settings, e.g., [15, 7, 2] and others. 

Theorem 5. There exists a solution to Problem 3 for prediction in expected average KL divergence for the 
set of all finite-memory process measures Ai := UketiMk- 

Proof. We will show that the set Ai is separable with respect to doo- Then the statement will follow from 
Theorem 4. It is enough to show that each set A^fe is separable with respect to doo- 

For simplicity, assume that the alphabet is binary {\X\ = 2; the general case is analogous). Observe 
that the family Aik of fc-order stationary binary-valued Markov processes is parametrized by \X\'' [0, 1]- 
valued parameters: probability of observing after observing xi,,k, for each xi,,k G X^. Note that this 
parametrization is continuous (as a mapping from the parameter space with the Euclidean topology to A^fc 
with the topology of doc). Indeed, for any /ii,/i2 G Aik and every xi..„ G X" such that /ii(xi..„) 7^ 0, 
i = 1, 2, it is easy to see that 



1 



log 



IllixL.n) 



IJ'2{xl..n) 



< 



sup 

xi..k+l 



1 



log 



fJ-iix^.k+i) 



M2(a;i.. 



k+l) 



(8) 



so that the right-hand side of (8) also upper-bounds doo(/^i;M2)j implying continuity of the parametrization. 

It follows that the set /i^, q G Q'"^' of all stationary fc-order Markov processes with rational values of all 

the parameters (Q := Q n [0, 1]) is dense in Aik, proving the separability of the latter set. D 



Another important example is the set of all stationary process measures S. This example also illustrates 
the difference between the prediction problems that we consider. For this set a solution to Problem 1 was 
given in [8]. In contrast, here we show that there is no solution to Problem 3 for S. 

Theorem 6. There is no solution to Problem 3 for the set of all stationary processes S . 

Proof. This proof is based on the construction similar to the one used in [8] to demonstrate impossibility of 
consistent prediction of stationary processes without Cesaro averaging. 

Let TO be a Markov chain with states 0,1,2,... and state transitions defined as follows. From each sate 
/c e N U {0} the chain passes to the state k + 1 with probability 2/3 and to the state with probability 
1/3. It is easy to see that this chain possesses a unique stationary distribution on the set of states (see, e.g., 
[13]); taken as the initial distribution it defines a stationary ergodic process with values in NU {0}. Fix the 
ternary alphabet X = {a, 0, 1}. For each sequence t = ti,t2,- ■ ■ G {0, 1}°° define the process fit as follows. It 
is a deterministic function of the chain m. If the chain is in the state then the process fit outputs a; if the 
chain m is in the state A; > then the process outputs tk- That is, we have defined a hidden Markov process 
which in the state of the underlying Markov chain always outputs a, while in other states it outputs either 
or 1 according to the sequence t. 

To show that there is no solution to Problem 3 for S, we will show that there is no solution to Problem 3 
for the smaller set C := {^t '■ t € {0, 1}°°}. Indeed, for any t € {0, 1}°° we have (i„(t, fit) = nlog3/2 + o{n). 
Then if p is a solution to Problem 3 for C we should have limsup„_^oQ -dn{t,p) < log 3/2 < 1 for every 
t Cz V, which contradicts Lemma 1. D 

From the proof Theorem 6 one can sec that, in fact, the statement that is proven is stronger: there is 
no solution to Problem 3 for the set of all functions of stationary ergodic countable-state Markov chains. 
We conjecture that a solution to Problem 2 exists for the latter set, but not for the set of all stationary 
processes. 

6 Discussion 

It has been long realized that the so-called probabilistic and agnostic (adversarial, non-stochastic, deter- 
ministic) settings of the problem of sequential prediction are strongly related. This has been most evident 
from looking at the solutions to these problems, which are usually based on the same ideas. Here we have 
proposed a formulation of the agnostic problem as a non-realizable case of the probabilistic problem. While 
being very close to the traditional one, this setting allows us to directly compare the two problems. As a 
somewhat surprising result, we can see that whether the two problems are different depends on the measure 
of performance chosen: in the case of prediction in total variation distance they coincide, while in the case of 
prediction in expected average KL divergence they are different. In the latter case, the distinction becomes 
particularly apparent on the example of stationary processes: while a solution to the realizable problem has 
long been known, here we have shown that there is no solution to the agnostic version of this problem. The 
new formalization also allowed us to introduce another problem that lies in between the realizable and the 
fully agnostic problems: given a class of process measures C, find a predictor that is predicts asymptotically 
optimal every measure for which at least one of the measures in C is asymptotically optimal (Problem 2). 
This problem is less restrictive then the fully agnostic one (in particular, it is not concerned with the be- 
haviour of a predictor on every deterministic sequence) but at the same time the solutions to this problem 
have performance guarantees far outside the model class considered. 

Since the problem formulations presented here are mostly new (at least, in such a general form), it is 
not surprising that there are many questions left open. A promising route to obtain new results seems to be 
to first analyse the case of prediction in total variation, which amounts to studying the relation of absolute 
continuity and singularity of probability measures, and then to try and find analogues in less restrictive 
(and thus more interesting and difficult) cases of predicting only the next observation, possibly with Cesaro 
averaging. This is the approach that we took in this work. Here it is interesting to find properties common 
to all or most of the prediction problems (in total variation as well as with respect to other measures of 
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the performance). A candidate is the "countable Bayes" property of Theorem 3: if there is a solution to a 
given sequence prediction problem for a set C, then a solution can be obtained as a mixture over a suitable 
countable subset of C 

Another direction for future research concerns finite-time performance analysis. In this work we have 
adopted the asymptotic approach to the prediction problem, ignoring the behaviour of predictors before 
asymptotic. While for prediction in total variation it is a natural choice, for other measures of performance, 
including average KL divergence, it is clear that Problems 1-3 admit non-asymptotic formulations. It is also 
interesting what are the relations between performance guarantees that can be obtained in non-asymptotic 
formulations of Problems 1-3. 

Appendix: Proof of Theorem 3 

Proof. Define the weights Wk :~ wk^^, where w is the normalizcr 6/it'^. Define the sets C^ as the set of all 
measures t E V such that ^ predicts r in expected average KL divergence. Let C+ := U^ecC'/j- For each 
T G C+ let p{t) be any (fixed) fj, G C such that t E C^. In other words, C"*" is the set of all measures that are 
predicted by some of the measures in C, and for each measure t in C+ we designate one "parent" measure 
p(t) from C such that p{t) predicts t. 

Step 1. For each fi G C"^ let 5n be any monotonically increasing function such that Snip) ~ o{n) and 
dn{i^,p{fj,)) — o{6n{n))- Define the sets 

U;: := |xi..„ G A-" : ^l{xl„n) > ^^pi^i-.n)] , (9) 

V;: := {xi..„ e A-" : p(a*)(xi..„) > 2-'"^^^ ^i{xi„n)} , (10) 

and 

TJ^-.^ujinv;:. (11) 

We will upper-bound j.i(T"i). First, using Markov's inequality, we derive 



Pixi..n) -^ ^\ < 1^ P(^i..«) ^ 1 
p{xi..n) ) ~ n ^ ii{xi„n) n' 



^(A'"\t/;:) = M(^^7z^>")<r^M^Ti^ = r- (12) 



Next, observe that for every n G N and every set A C <-f", using Jensen's inequality we can obtain 

E, N, P{xi..n) I A\ ^^ 1 I \^ P{xi..n) 



>-^,{A)\ogP^^>-^,{A)\ogp{A)-\. (13) 



Moreover, 

dn{p,p{p)) = - Yl M(a;i..«)log 



P{p){xi..n 
p{xi..n) 



^p{n){xi..n) 



- J2 A^(^i..«)i"g , ^ '■■:' > 5^{Pn)p{x-\r;) 1/2, 

where in the inequality we have used (10) for the first summand and (13) for the second. Thus, 

,(A-"TO< '^"(^'^^[^)) + ^/^ ^o(l). (14) 
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From (11), (12) and (14) we conclude 

M(A'"\r;) < i.ix^^\vp + ti{x-\u;) = o(i). (15) 

Step 2n: a countable cover, time n. Fix an n G N. Define to" := niax^gc p{TJ^) (since A"" are finite 
all suprema are reached). Find any /i" such that pi{TJ^n.) — ?ti" and let T" :— r"^. For k > 1, let 
TO^ := max^ecp(T;^\Tfc"_i). If m^ > 0, let ^^ be any V e C such that p(T;^«\t£j = to^, and let 
T^ := TJ^^^ U T"„; otherwise let T^ :— T^_i- Observe that (for each n) there is only a finite number of 
positive TO^, since the set X" is finite; let Kn be the largest index k such that to^ > 0. Let 



Y.ww{l4)- (16) 



k=l 

As a result of this construction, for every n G N every k < Kn and every xi..„ G T^ using the definitions (11), 
(9) and (10) we obtain 

!^„(xi..„) > w;fc-2-*"(^V(a^i..n)- (17) 

n 

Step 2: the resulting predictor. Finally, define 

1 1 V- 

nSN 

where 7 is the i.i.d. measure with equal probabilities of all a; G <-f (that is, 7(2:1. .„) — jA"!^" for every n G N 
and every a;i..„ G A"). Wc will show that ly predicts every /j, G C+, and then in the end of the proof (Step r) 
we will show how to replace 7 by a combination of a countable set of elements of C (in fact, 7 is just a 
regularizer which ensures that j/-probability of any word is never too close to 0). 

Step 3: V predicts every jjl G C+. Fix any jjl G C+. Introduce the parameters e" G (0, 1), n G N, to be 
defined later, and let j]^ := l/e;\ Observe that p{Tll^\TJ}_^) > piT^+i\T^), for any fc > 1 and any n G N, 
by definition of these sets. Since the sets TJ}\TJ}_^, k G N are disjoint, we obtain p(TJ}\TJ}_^) < l/k. Hence, 
piTJ^\Tp) < el for some j < jJJ, since otherwise m] = max^ec p(7';^\7;" ) > £]1 so that p{Tp„^^\Tp„) > 
e" = 1/iui which is a contradiction. Thus, 

p(r;\T;„) < e-. (19) 

We can upper-bound fi{TJ];\Tji) as follows. First, observe that 

P{xi„n) 



dn(p,p) = - ^ ^(a:;i..„)log 



ti{xi..n) 



Y^ Ai(xi..„)log 



.i..„6T;^\T" 



a;i..„eA".\T" 



P{xi..n) 

M(a;i..„) 

P{xi..n) 
P{xi„n) 



Then, from (11) and (9) we get 



^1 + 11 + III. (20) 
/>-logn. (21) 



From (13) and (19) we get 

// > -A*(r;\i;^) iogp(T;\T;„) - 1/2 > -A*(r;\r;„) loge;: - 1/2. (22) 
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Furthermore, 

M(A'«\r;) 



III > Y. ^(^l-«) l0gA^(^l..n) > M('^"\r;) log ■ |^„^^„| 



>---M(A'"\T>log|A'|, (23) 



1 
2 

where the first inequahty is obvious, in the second inequality we have used the fact that entropy is maximized 
when all events are equiprobable and in the third one we used |A'"\T"| < \X\". Combining (20) with the 
bounds (21), (22) and (23) wc obtain 

dni^i, p)>-\ogn- A*(r;\r« ) loge;: - l - f,iX-\Tpn\og \X\, 

so that 

KT;:\T^.) < -^^(d„(M,p)+logn + l + M(A'"\T;)nlog|A'|). (24) 

From the fact that (i„(/i, p) ~ o(n) and (15) it follows that the term in brackets is o(n), so that we can 
define the parameters e" in such a way that —logs" = o{n) while at the same time the bound (24) gives 
li{T";\T"n) =- o(l). Fix such a choice of e" Then, using (15), we conclude 

A.(A'"\i;p < M(-^"\r;) + m(t;\7;"„) = o(i). (25) 

We proceed with the proof of dnC^? ^) = o{n). For any xi..^ ^ ^J^ we have 

y{xi..n) > lwr,J^^ix^„„) > i«;„u;„i2-^"(^)p(xi..„) = !^(e;j)22-^"(^)p(xi..„), (26) 

2 2 ^ n 2n ^ 

where the first inequality follows from (18), the second from (17), and in the equality we have used Wjn = 
w/(j")^ and j" = 1/e^- Next we use the decomposition 



dn{fJ.,i^) = - V" fi{xi,,n)\og— — ^ - V" ^(xi..„)log— - — ^ = I + II. (27) 

From (26) we find 



/<-l0g(^(sP^2-^"(-))- Y: M-l..n)l0g^^""-"^ 



^"^ "^ ' x.Ztr-^ " ^(^1-") 



= (1 + 3 log n - 2 log e" - 2 log u; + J„ {^)) 



dn{n,p)+ V A^(xi..„)log ^••" 



< o{n) - Y M(a;i..«)logM(a;i..«) 

<o(n)+M(A'"\T")nlog|A'| =o(n), (28) 

where in the second inequality we have used — logeJJ = o(n), dn{p,,p) = o{n) and (5„(/x) = o(n), in the last 
inequality we have again used the fact that the entropy is maximized when all events are equiprobable, while 
the last equality follows from (25). Moreover, from (18) we find 

// < log2 - V ^(a;i..„)log4^ < 1 + n^i{X'-\T^^)\og\X\ = o{n), (29) 
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where in the last inequahty we have used 7(2:1. .„) = \X\ " and /i(xi..„) < 1, and the last equality follows 
from (25). 

From (27), (28) and (29) we conclude ^dn{v,^.) -^ 0. 

Step r: the regularizer 7. It remains to show that the i.i.d. regularizer 7 in the definition of v (18), can 
be replaced by a convex combination of a countably many elements from C. Indeed, for each n G N, denote 

An :- {a;i..„ e <¥" : 3^ G C ^l{xl..n) ^ 0}, 

and let for each a;i..„ € <Y" the measure Hxi „ be any measure from C such that fixi „ (2;i..n) ^ 5 sup ^^^ ^J,{xi,,n)- 
Define 

for each x[ ,^ G A", n € N, and let 7' :== X^fceN ^fc7fc- ^or every ^ e C we have 

7'(2;i..n) > W«|^nrV2:i..„(a;i..„) > -W„|A'|""/x(xi..„) 

for every n e N and every a;i..n G A„, which clearly suffices to establish the bound // = o{n) as in (29). D 
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